-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve order if necessary when deduping categoricals internally #11597
Preserve order if necessary when deduping categoricals internally #11597
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #11597 +/- ##
===============================================
Coverage ? 86.41%
===============================================
Files ? 145
Lines ? 22998
Branches ? 0
===============================================
Hits ? 19874
Misses ? 3124
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
rerun tests |
.drop_duplicates(ignore_index=True) | ||
._column | ||
) | ||
new_cats = dedup_preserve_order(cudf.Series(new_cats))._column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have dedup_preserve_order
operate on a Column
rather than a Series
? As much as possible, we want to avoid this pattern of constructing a Series
out of a column, invoking a Series operation on it, and then getting the column back out of the result.
Having it operate on a column would have the added benefit that we can then make it a method of CategoricalColumn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shwina those are all good points, I'll update as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to ColumnBase
in the end as I ended up not being able to see any reason all columns shouldn't share the capability. Let me know if this should be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -217,6 +217,16 @@ def dropna(self, drop_nan: bool = False) -> ColumnBase: | |||
# The drop_nan argument is only used for numerical columns. | |||
return drop_nulls([self])[0] | |||
|
|||
def _dedup_preserve_order(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be merged with the unique()
method? I'm thinking col.unique()
returns unique values in arbitrary order, and col.unique(preserve_order=True)
does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this offline and came to the conclusion that long term we should implement #11638 to back this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we still collapse this _dedup_preserve_order()
method into unique()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@gpucibot merge |
Closes #11486